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Abstract 

We describe a new method to accelerate neighbor searches on GRAPE, i.e. a special purpose hard- 
ware that efficiently calculates gravitational forces and potentials in A-body simulations. In addition to 
the gravitational calculations. GRAPE simultaneously constructs the lists of neighbor particles that are 
necessary for Smoothed Particle Hydrodynamics (SPH). However, data transfer of the neighbor lists from 
GRAPE to the host computer is time consuming, and can be a bottleneck. In fact, the data transfer 
can take about the same time as the calculations of forces themselves. Making use of GRAPE's special 
treatment of neighbor lists, we can reduce the amount of data transfer if we search neighbors in the or- 
der that the neighbor lists, constructed in a single GRAPE run, overlap each other. We find that the 
Morton-ordering requires very low additional calculation and programming costs, and results in successful 
speed-up on data transfer. We show some benchmark results in the case of GRAPE-5. Typical reduction 
in transferred data becomes as much as 90%. This method is suitable not only for GRAPE-5, but also 
GRAPE-3 and the other versions of GRAPE. 
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1. Introduction 

GRAPE (GRAvity PipE; Sugimoto et al. 1990) is a spe- 
cial purpose hardware that calculates Newtonian gravita- 
tional forces efficiently in large scale A-body simulations. 
A series of GRAPE was developed by Ito et al. (1990; 
GRAPE-1), Ito et al. (1991; GRAPE-2), Okumura et al. 
(1993; GRAPE-3), Makino et al. (1997; GRAPE-4), and 
Kawai et al. (2000; GRAPE-5). GRAPE is connected 
with, and controlled by a typical workstation or PC. The 
host computer requests of GRAPE to calculate gravita- 
tional forces. Force integration and particle pushing are 
all done on the host computer. Owing to its high per- 
formance, GRAPE has been a powerful tool for solving 
astronomical A-body problems, such as those related to 
star cluster evolution (Makino 1996), black hole spiral-in 
(Makino & Ebisuzaki 1996), formation of planets (Kokubo 
& Ida 1998), and formation of central cusps in dark matter 
halos (Fukushige & Makino 1997). 

In addition to the efficient gravitational calculations, 
GRAPE performs parallel gathering of neighbor parti- 
cles, and returns neighbor lists to the host computer 
if requested. Since searching neighbors is one of the 
most time-consuming routines in some particle simu- 
lations with close interactions, such as the Smoothed 
Particle Hydrodynamics (SPH), the neighbor lists from 
GRAPE are advantageous in speeding up those simula- 
tions. Thus SPH is often combined with A-body calcu- 



lations using GRAPE. Pioneering work on the GRAPE- 
SPH method was done by Umemura et al. (1993), us- 
ing GRAPE-lA. Steinmetz (1996) reported the high per- 
formance of the GRAPE-SPH method using GRAPE-3. 
The GRAPE-SPH method has been successfully applied 
to a number of topics, e.g., fragmentation of molecular 
clouds (Klessen 1997), and galaxy formation (Steinmetz 
& Miiller 1995; Weil et al. 1998; Mori et al. 1999; Koda et 
al. 2000a; Koda et al. 2000b). 

Despite the high performance of GRAPE-SPH, search- 
ing neighbors is still a massive routine in full calcula- 
tions (Steinmetz 1996). In particular, the data transfer 
of neighbor lists between GRAPE and the host computer 
is a bottleneck for speed-up. Steinmetz (1996) pointed out 
that, owing to the specification of GRAPE, the amount 
of the data transfer can be reduced if the neighbor lists 
for particles, returned from GRAPE at once, overlap each 
other (see Section 2 in details); in the case of GRAPE-3 
that can construct 8 neighbor lists for 8 particles simul- 
taneously, if the lists are completely the same, the com- 
munication time for the 8 lists becomes as fast as that for 
a single list. In order to make the lists overlap at least 
partially, Steinmetz (1996) sorted the particles into the 
A-coordinate order before the GRAPE call, because the 
particles, having similar neighbor lists, must have similar 
positions, and thus similar A-coordinatcs. This approach 
reduced the time consumed in entire neighbor searches by 
10-20 %, in simulations using a few tens of thousands of 



2 



T.R.Saitoh and J.Koda 



particles and GRAPE-3. 

This approach, however, becomes less effective when the 
number of particles increases, because the particles with 
similar X-coordinates are more likely to have very differ- 
ent F,Z-coordinates. In this paper, we introduce the other 
ordering method, i.e. the Morton ordering method, for 
GRAPE neighbor searches, which keeps track of original 
3-dimensional particle coordinates, and are independent of 
the number of particles. We show some test calculations of 
neighbor searches using GRAPE-5 and Morton ordering. 
The Morton ordering method has been suggested for some 
parallel tree algorithms for TV-body simulations (Barnes & 
Hut 1989; Warren & Salmon 1995), and now, is applied 
to searching neighbors in GRAPE-SPH. 

We briefly review some GRAPE hardware specifica- 
tions, related to searching neighbors in §2, and Morton 
ordering in §3. Test calculations and results are shown in 
§4 and §5, respectively. Summary appear in §6. 

2. GRAPE: Specification for Neighbor Search 

A GRAPE series is a special purpose board, similar to 
a graphic board, used to accelerate gravitational force cal- 
culations in ./V-body problems. It is connected with, and 
controlled by a host computer, i.e. a typical workstation 
or PC. The host computer sends particle positions and 
masses to GRAPE, and GRAPE calculates gravitational 
forces and potentials, sending them back to the host com- 
puter. GRAPE can also return neighbor particle lists, 
and thus, is suitable for simulations with close interac- 
tions, such as SPH simulations. The data transfer of the 
neighbor lists, however, takes much more time than that 
for the others, i.e. for a single particle, typically sixty 
words for a neighbor list should be transferred, while only 
ten words are transferred for mass, position, force, po- 
tential, gravitational softening, and radius of the particle. 
We will introduce a method to speed up this data transfer 
in §3. Though we describe the case for the fifth version 
of the GRAPE series (GRAPE-5), the method is also well 
suited for GRAPE-3 and the other versions of GRAPE. 
Detailed designs of the GRAPE-5 hardware are described 
in Kawai et al. (2000). Hence, we give a brief review, and 
necessary details for our new neighbor searching method. 

The main engine of a GRAPE-5 board is composed of 
G5 chips. The G5 chip is a custom LSI chip, which calcu- 
lates gravitational forces and potentials. One GRAPE-5 
board has eight G5 chips, each of which calculates the 
forces on 12 particles simultaneously, hence one board 
calculates the forces on 8 x 12 = 96 particles at once. 
Gravitational force f, on a particle i is derived by first 
calculating the force between two particles (i and j), 
and then summing them up among all particles Djfij . The 
G5 chip can also check whether a particle j is a neighbor 
of a particle i, by comparing the square of the distance r?- 
between i and j, with the square of the radius hi of i, as 
rfj<hl 

The neighbor lists, output from the G5 chips, are stored 
in special memories on the GRAPE-5 board. There arc 
two memory units on a single GRAPE-5 board, each of 



which stores the neighbor lists from four of the eight G5 
chips on a board, and thus, for 4 x 12 = 48 particles. 
GRAPE-5 does not keep the neighbor lists in a simple 
lengthy manner, such that all neighbors of the 48 parti- 
cles occupy individual memory space, but in a way that 
keeps the lists as particle indices and flags. For example, 
when one particle has a neighbor list of (8,11,22,41,49) 
and another has (3,7,11,23,41), these two lists are kept as 
a convolved particle index list (3,7,8,11,22,23,41,49) and 
binary flags (01,01,10,11,10,01,11,10). The host computer 
receives these indices and flags from GRAPE, and de- 
convolves these into individual neighbor lists for the two 
particles. One neighbor memory unit, of the two on a 
GRAPE-5 board, stores the neighbor lists for the 48 par- 
ticles. Thus the above convolved list and binary flags are 
made for the 48 particles. 

This particular operation for the neighbor lists pro- 
vides room to speed up the data transfer, and hence, the 
neighbor search. Considering the case that each particle 
has n s = 60 neighbor particles, if the 48 particles have 
completely different neighbors, the data transferred from 
GRAPE-5 to the host computer are 48 x 60 = 2880 words 
for the convolved list, and 2880 words for the binary flags. 
On the other hand, if we can arrange the 48 particles 
so that they have perfectly identical neighbor lists, the 
amount of data is significantly reduced to 60 words for 
the convolved list and 60 words for the binary flags, which 
means that the communication time is reduced by 1/48 1 . 
Hence, we can reduce the communication time for search- 
ing neighbors with GRAPE, by arranging the 48 particles 
so that their neighbors overlap significantly. 

3. GRAPE Neighbor Search with Morton 
Ordering 

A GRAPE-5 board searches neighbors for 96 particles 
simultaneously in a single GRAPE run. In large A-body 
simulations, the GRAPE run is repeated A/96 times. 
Each of the two memory units on a GRAPE-5 board keeps 
the neighbor lists for 96/2 = 48 particles in a single run, 
and the lists are transferred from GRAPE to the host 
computer. According to the GRAPE specifications in §2, 
we can reduce the amount of the transferred data if we 
choose the 48 particles, for a single memory in a single 
run, so that their neighbors overlap each other. The cost 
of the data transfer is reduced by increasing the fraction 
of the overlap. 

In order to make the neighbor lists overlap, we should 
choose 48 intrinsically neighboring particles, i.e. parti- 
cles with similar coordinates, for a single GRAPE run. 
Based on this idea, Steinmetz (1996) sorted all the par- 
ticles according to their X-coordinates, and succeeded in 
reducing the communication cost between GRAPE-3 and 
a host computer. This method expects that the parti- 

1 In actual operations, a particle index and binary flag, for a neigh- 
bor particle, are not separately treated. GRAPE stores them in 
a single 64-bits memory block; the higher 48-bits for the flag, 
and the lower 16-bits for the index (see Kawai et al. 2000 for 
details). 
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cles arranged by A-coordinates would more frequently 
have similar (A, Y, Z)-coordinates than randomly dis- 
tributed particles. However, this method becomes less 
effective in very large GRAPE-SPH simulations, because 
the radius h for searching neighbors becomes smaller in 
larger simulations, and thus, two particles with similar 
A-coordinates would more frequently have quite different 
Y, Z-coordinates, which makes the separation of the two 
more than h. Therefore, we suggest the use of Morton 
ordering, rather than A-coordinate ordering. Morton or- 
dering naturally translates the (A, Y, Z)-coordinates into 
a 1-D space, with sufficiently maintaining the original 3- 
D structure. Morton ordering has been suggested for a 
parallel tree code for gravitational calculations (Barnes & 
Hut 1989). 

In the Morton ordering, the 3-D coordinates (A, Y, Z) 
= (0.xiX2X 3 ...,0.yiy 2 y3-.-,0-ZiZ2Z 3 ...) of a particle is trans- 
lated into a 1-D key as 0.xiyiziX2y2Z2-... Then the par- 
ticles are sorted according to those keys. Since those 1-D 
keys sufficiently have the memory of the original 3-D co- 
ordinates, the two particles with similar keys lie close to 
each other in the 3-D space as well. In actual operations, 
the key is constructed in a binary space, and hence, can 
be simply produced by bit-shift and add. Thus the costs 
for the key construction in calculations and the coding by 
a programmer are quite low. Additional time for sorting 
is also negligible. A demonstration of the Morton order- 
ing in a 2-D case for 3,000 particles is shown in Figure 
1. The particles are randomly and uniformly distributed 
in a unit circle, and connected in the Morton order (key 
order) with the single stroke of a pen. It is evident that 
the Morton ordering arranges the particles in such a way 
that those with similar 2-D coordinates lie close to each 
other in the key space (1-D), and that the particles with 
similar keys must have quite similar neighbor lists. Hence 
the Morton ordering is effective to make the neighbor lists 
of the 48 particles overlap. 

Figure 1 

4. Test Calculations 

We test the efficiency of the above new method 
(GRAPE+Morton ordering), in comparison with two 
other methods using GRAPE. We distribute particles in 
space, search neighbors for those particles using GRAPE, 
and measure the time consumed for the neighbor search. 
Before starting the GRAPE neighbor search, we rearrange 
the particles (1) in a random order (hereafter, R-ordcring), 
i.e. with no rearrangement, (2) in a A-coordinatc order 
(X-ordering) , and (3) in a Morton order (M-ordering) . 
The last two orderings will actively make the 48 neigh- 
bor lists, stored on a single GRAPE memory unit, overlap 
each other, which improves the efficiency of the GRAPE 
neighbor search as discussed in §3. 

In actual calculations, such as cosmological and galaxy 
formation simulations, there appear various density distri- 
butions. Matter is uniformly distributed in the early stage 
of the Universe, gradually assembled and collapsed by 



gravity, and then, form nearly isothermal objects. Hence 
we adopt spherically symmetric density profiles p(r) oc r n 
with the index n of 0.0 (uniform) and —2.0 (isothermal), 
for test calculations. The density profiles are constructed 
by randomly and uniformly distributing particles in a unit 
sphere, and stretching the distribution by means of a ra- 
dial coordinate transformation, i.e. r new = r^j 3+n . We 
also test the Hernquist profile (Hernquist 1990) as a real- 
istic density model of dark matter halo, i.e. 



where the core and truncation radii are set to a = 0.1 
and r max = 1.0, respectively. The number of particles, in 
the test calculations, is changed from 10,000 to 100,000 
every 10,000, which may be possible numbers for actual 
SPH simulations with the direct 0{N 2 ) calculations of 
GRAPE-5. The neighbor search radius h of each parti- 
cle is set at the distance of its n s th nearest particle, and 
we set n s = 60 when no descriptions are given explicitly. 
This definition of h is often used in SPH calculations. We 
repeat the neighbor search 10 times for each test calcu- 
lation, and average them to get benchmark results, since 
the results are slightly swayed in individual runs. 

For the test calculations, we use a single GRAPE- 
5 board connected with an Alpha 264 processor com- 
puter with a clock frequency of 833MHz, which is one 
of the GRAPE systems in the Mitaka Under Vineyard 
(MUV), run underground at the National Astronomical 
Observatory of Japan. 

5. Results 

5.1. Consumption Time 

Table 1 summarizes the consumption times in neigh- 
bor searches with GRAPE, in cases using random or- 
dering (R-ordering) , A-coordinate ordering (X-ordering), 
and Morton ordering (M-ordering). The tabulated times 
include both the data transfers between GRAPE and the 
host computer, and the calculations for searching neigh- 
bors in GRAPE. The above three orderings differ only in 
their data transfer times. Figure 2 shows a corresponding 
plot for the isothermal density profile (n = —2), where the 
times for the GRAPE calculations without data transfer 
are also drawn as crosses. The differences between crosses 
and the other marks indicate the times for data transfer 
and its overhead. 

- Table 1 - 

Generally, the consumption times show no clear differ- 
ence among three density profiles, because GRAPE in- 
trinsically does 0(A 2 )-operations, which do not depend 
on density profiles. Hence we hereafter discuss only the 
case for the density profile of an index n — — 2. It is evi- 
dent that M- and X-ordcring work faster than R-ordcring 
for any A, and that M-ordering is more efficient than X- 
ordering. 

In our GRAPE system, M-ordering works twice as fast 
as R-ordering for N = 10,000, while X-ordering does only 
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1.3 times as fast. M-ordering is 1.5 times faster than R- 
ordering for N = 50,000, while X-ordering is 1.1 times 
faster. Both M- and X-ordcring apparently become less 
effective for larger N on the basis of total calculation time 
(Figure 2), while M-ordering keeps its efficiency even in 
larger N (see §5.2). This is because both orderings save 
only the communication costs, i. e. 0(iV)-operations, 
between GRAPE and the host computer. However calcu- 
lations in GRAPE, i.e. 0(N 2 ), become more dominant 
for larger N. 

— Figure 2 — 

For N = 100,000, the largest number in our tests, X- 
ordering becomes inefficient, i.e. consuming almost the 
same amount of time as R-ordcring, while M-ordcring 
is still 1.4 times faster than R-ordcring. Therefore, M- 
ordering is best suited for neighbor searches with GRAPE. 

5.2. Data Compression Factor 

In §2 we described how the communication time be- 
tween GRAPE and a host computer is reduced if we make 
the neighbor lists for 48 particles, kept in a memory unit 
in a single GRAPE run, overlap each other. In order to 
describe how much the lists overlap, we define the mean 
data compression factor of neighbor lists as 

pqtrans 

f ~~ total ' (^) 

where Nl° tal is a simple sum of the numbers of neighbors 
for all rip particles, and N^ rans is the number of neighbors 
actually transferred from GRAPE to the host computer, 
according to the GRAPE specifications (§2). We note 
that this factor does not depend on the speeds of a host 
computer and an interface between GRAPE and the host 
computer. The communication time is reduced in propor- 
tion to /. 

If we consider a single GRAPE run and the case that 
each of the n p = 48 particles has 60 neig hbors, Nl otal is 
48 x 60 = 2880. If the neighbors of the 48 particles are 
completely independent, N* rana becomes 48 x 60 = 2880. 
Then the compression factor becomes / = 1, meaning 
no compression. If the neighbor lists are perfectly iden- 
tical, N^ rans becomes 60 as described in §2, and then 
the compression factor takes its theoretical minimum, i.e. 
f = 1/rip [Note that this is an insubstantially ideal case 
(see A. 2)]. For test calculations with a large number of 
particles, the GRAPE run must be repeated many times. 
Then we average / in all the data transfer for neighbor 
lists in all the runs. 

Table 2 lists / for all the test calculations, and a cor- 
responding plot for the isothermal profile (n = —2) are 
presented in Figure 3. The compression factors / of R- 
and X-orderings increase with the number of particles 
N. The / for X-ordering is efficiently as small as 0.5 
for N = 10, 000; however it increases to about 0.8 for 
N = 100,000. R-ordering shows almost no data compres- 
sion (/ = 0.98), that is, 98% are left for data transfer in the 
case of N = 100, 000. Hence X- and R-orderings do not 
work well for large N calculations. On the other hand, 



M-ordcring keeps / almost constant at the low value of 
0.13 for all the Ns (Figure 3). This is why M-ordcring 
is still effective in large N calculations. The low value 
of / = 0.13 means that the neighbor lists, sorted simulta- 
neously on a single GRAPE memory unit, overlap almost 
perfectly (87%), and thus, implies that there is little room 
for further improvement. Therefore we conclude that our 
new method (GRAPE+Morton ordering) is the best for 
neighbor searches using GRAPE. 

— Table 2 - 

— Figure 3 — 

5.3. Dependence on n s 

Figure 4 shows the n s dependence of consumption time 
in the case of the isothermal density profile (to = — 2) and 
N = 100,000. We tested the range of n s = 30 - 120, which 
is used in actual SPH calculations. Basically, the con- 
sumption times increase with n s , because the number of 
neighbor particles, transferred from GRAPE to the host, 
increases with n s . We note, however, that there is another 
effect that suppresses the increase of time. The larger n s 
means larger radii (volumes) of particles, and thus, indi- 
cates larger overlap of their neighbor lists. This reduces 
data transfer, and results in saving time. Figure 5 shows 
this effect; the overlap fractions / decrease with increas- 
ing n s . In our realistic range of n s , M-ordering shows the 
best overlap fraction, and thus is the best in any n s . 

Figure 4 

— Figure 5 — 

6. Summary 

We have reviewed the specifications of a special purpose 
hardware called GRAPE, and introduced a new method 
which can speed up neighbor searches in large particle 
simulations using GRAPE. The main conclusions are the 
following: 

1. We introduced a new method, that is, arranging 
particles in a Morton order before performing GRAPE 
calculations. This method saves the communication cost 
between GRAPE and its host computer. The cost for 
additional programming is very low. 

2. We compare this Morton-ordering method with some 
previous methods, and conclude that the Morton-ordering 
method is much more effective. In a case where the to- 
tal particle number is N = 10, 000, the Morton-ordering 
method is twice as fast as a simple neighbor search with 
GRAPE in our GRAPE system. 

3. Communication between GRAPE and its host com- 
puter can be minimized if the neighbor lists, stored in 
a single GRAPE memory unit, overlap each other. The 
Morton-ordering method reduces the communication by 
about 90%, thus leaving little room for further improve- 
ment. 

4. The Morton-ordering method becomes less effective 
for larger particle simulations, as do the other previous 
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methods, because 0(Y 2 )-calculations, other than data Weil, M.L., Eke, V.R., and Efstathiou. G. 1998, MNRAS, 300, 
transfer, become dominant. However, it is still efficient 773 
for simulations with N = 100,000. 

5. The communication increases with the typical num- 
ber of neighbor particles. The Morton-ordering method 
is the best in any number that is usually used in SPH 
calculations. 

6. We showed the efficiency of the Morton-ordering 
method only for GRAPE-5. However, it is also suitable 
for the other versions of GRAPE. In fact, this method 
has been effectively used for galaxy formation simulations 
using GRAPE-3 (Koda et al. 2000a; Koda et al. 2000b). 
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Appendix 1. Performance Estimation 

The total calculation time for neighbor search with 
GRAPE will be modeled as 

T = T h + T g + T t , (Al) 

where T^, T g and T t stand for the time consumed on the 
host computer, that on GRAPE, and that on data trans- 
fer of neighbor lists from GRAPE to the host computer, 
respectively. Th and T g are modeled as 



c h N 
c g N 2 



(A2) 
(A3) 



where Ch and c g represent the miscellaneous calculation 
time per particle on the host computer, and the time spent 
on a two body interaction on GRAPE, respectively. N is 
the number of particles in calculation. 

The total number of neighbor particles transferred from 
GRAPE to the host computer is Nn s f, where n s is the 
typical number of neighbors for one particle, and / is the 
data compression factor defined in §5.2. T t would take a 
form as 



T t = c t Nn s f, 



(A4) 



where Ct is the time spent on data transfer per neighbor 
particle. For our GRAPE system (GRAPE-5 and a host 
computer with an Alpha 264 processor 833MHz), we ob- 
tain the coefficients by fitting, and list them on Table 3. 
Figure 6 shows a plot of estimated v.s. measured T for 
R-, X-, and M-orderings. Different symbols are used for 
different ordcrings. All the points are well on the propor- 
tional line (solid). 

- Table 3 - 



a sphere of the radius r a = r p + r s . Hence the number of 
neighbor particles, stored in a GRAPE memory and trans- 
ferred from GRAPE to the host computer, becomes 



N, 



trans 



1/3. 



- I) 



1/3)3 



(A5) 



Since the total accumulated number of neighbors for the 
n p particles is Nl° tal = n p n s , f is calculated as 



f 



(n 



1/3 
V 



- n 



1/3N3 



(A6) 



This / does not depend on N, and approaches l/n p when 
n s — > oo. Figure 3 shows this / (dotted line). The results 
of M-ordering (squares) are close to the estimated / of 
this ideal case. 

Note, we here assumed that all the neighbors are closely 
packed in an even spherical space with a radius r a = r p + 
r s . However, this assumption is valid only in an infinite 
limit of n s , because the actual space occupied by small n s 
particles must have uneven surface, which is completely 
enclosed by our assumed sphere. Then jV* rans , and /, is 
smaller than that estimated by eq.(A5). Hence in Figure 
3, M-ordering gives slightly smaller / than the estimated 
one for the spherical case. This difference becomes smaller 
with increasing n s , which is confirmed in Figure 5 (dotted 
line and squares). 

Eq. (A6) gives a thoughtful minimum of / that can 
be occurred in actual calculations. The fair coincidence 
of this minimum value with those from M-ordering gives 
us the confidence that M-ordering is the ideal method for 
neighbor search in GRAPE-SPH. 



Figure 6 



Appendix 2. Theoretical Estimate of / 

The data compression factor / takes the minimum of 
l/n p in an insubstantially ideal case that all the n p par- 
ticles have an identical neighbor list, however, its actual 
minimum, occurred in calculation, would be larger. We 
estimate the / in a thoughtfully ideal case, and compare it 
with the results of M-ordering. We consider the case that 
the n p particles (see §2) are selected very successfully, i.e. 
the case that their neighbor lists overlap almost as much 
as possible. Since we have shown that / does not depend 
on the distribution of particles (§5), we assume that N 
particles are distributed in a unit sphere with a uniform 
density, i.e. p = 3N/4ir. In the ideal case, the n p particles 
themselves must be closest neighbors each other, and be 
clustered in a small region. In the following we assume 
that this region has a spherical form. 

The n p particles are distributed in a sphere with the ra- 
dius r p = (rip/N) 1 / 3 . If we take into account that some of 
the n p particles are on the surface of the sphere, and that 
each of them has n s neighbors and a radius r s = (ris/N) 1 ^ 3 , 
then all the neighbors of n p particles will be at least within 



No. ] Acceleration Method of Neighbor Search with GRAPE and Morton-ordering. 

Table 1. Time consumed by neighbor search 
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ivi-orct. 


R-ord. 
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X-ord. 


M-ord. 


1UUUU 


0.61 


0.45 
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0.45 
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1.40 


1.18 
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1.40 


1.18 


0.79 


1.34 


1.11 


0.75 


30000 


2.41 


2.17 


1.48 


2.52 


2.23 


1.49 


2.36 


2.09 


1.42 


40000 


3.75 


3.44 


2.36 


3.75 


3.42 


2.37 


3.56 


3.24 


2.29 


50000 


5.07 


4.75 


3.43 


5.09 


4.73 


3.43 


4.95 


4.58 


3.33 


60000 


6.95 


6.52 


4.69 


6.85 


6.42 


4.70 


6.64 


6.23 


4.60 


70000 


8.89 


8.38 


6.17 


8.69 


8.22 


6.16 


8.49 


8.23 


6.05 


80000 


10.9 


10.3 


7.80 


10.9 


10.3 


7.82 


10.3 


9.89 


7.63 


90000 


13.1 


12.5 


9.62 


13.1 


12.5 


9.66 


12.5 


12.0 


9.42 
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15.2 


14.7 


11.6 


15.2 


14.7 


11.6 


15.1 


14.5 


11.4 



Test calculations for a unit sphere with a density profile 
of p oc r" , and the Hcrnquist profile. Consumption time 
is presented in units of seconds. Particles are randomly 
distributed, and R-, X-. and M-ordcring rearrange the 
particles in random order, in A"-coordinate order, and in 
Morton order, respectively, before GRAPE calculations. 



Table 2. Data Compression Rate /. 



Number 


Power index n 
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Power index n 
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R-ord. 


X-ord. 


M-ord. 


R-ord. 


X-ord. 


M-ord. 
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X-ord. 


M-ord. 


10000 


0.85 


0.48 


0.12 


0.85 


0.47 


0.13 


0.85 


0.47 


0.13 


20000 


0.92 


0.60 


0.13 


0.92 


0.59 


0.13 


0.93 


0.59 


0.14 


30000 


0.95 


0.67 


0.13 


0.95 


0.66 


0.13 


0.95 


0.65 


0.14 


40000 


0.96 


0.72 


0.13 


0.96 


0.70 


0.13 


0.96 


0.70 


0.14 


50000 


0.97 


0.75 


0.13 


0.97 


0.73 


0.13 


0.97 


0.73 


0.14 


60000 


0.97 


0.77 


0.13 


0.97 


0.76 


0.13 


0.97 


0.76 


0.14 
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0.98 


0.78 


0.13 


0.98 


0.77 


0.13 


0.98 


0.77 


0.14 
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0.98 


0.79 


0.13 


0.98 


0.78 


0.13 


0.98 


0.78 


0.14 


90000 


0.98 


0.81 


0.13 


0.98 


0.79 


0.13 


0.98 


0.79 


0.14 


100000 


0.98 


0.82 


0.13 


0.98 


0.80 


0.13 


0.98 


0.80 


0.14 



Table 3. Timing constants for performance estimation 



Parameter 


Constant 




(sec) 


Ch 


1.8 x 10" 5 


Cg 


9.0 x 10~ 10 


Ct 


7.3 x 10- 7 
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Fig. 1. Demonstration of the Morton-ordering in 
a 2-D case for N=3,000. Randomly distributed 

particles are connected in a Morton order. 
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Fig. 3. Data compression factor v.s. number of parti- 
cles, in the case for the density profile index of n = —2. 
The same marks are used as Figure 2. Dotted line in- 
dicates a theoretical estimate of / in the case that the 
n p = 48 particles are spherically distributed (A. 2). 
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Fig. 2. Consumption time v.s. number of particles, in the 
case for the density profile index of n = —2. Circles are 
times for R-ordering, triangles are for X-ordering, and squares 
are for M-ordering. Crosses indicate the time consumed for 
GRAPE calculations without data transfer, which cannot 
intrinsically be suppressed in the above three methods. 



Fig. 4. n s dependence of consumption time for 
n = -2.0 and AT = 100, 000. Circles are times for 
R-ordering, triangles are for X-ordering, squares are 
for M-ordering. Solid line indicates the consumption 
time for GRAPE calculations without data transfer. 
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Fig. 5. n s dependence of data compression factor / for 
n = —2.0. The same marks are used as Figure 4. Dotted 
line indicates a theoretical estimation of / in the case that 
the rip = 48 particles are spherically distributed (see A. 2) 
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Fig. 6. Estimated v.s. measured times consumed on neigh- 
bor search. The same marks are used as Figure 4. Solid line is 
a proportional line. Estimated time is calculated by eq. (Al). 



